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Abstract 

The estimation of Bayesian networks given high-dimensional data, 
in particular gene expression data, has been the focus of much recent 
research. Whilst there are several methods available for the estima- 
tion of such networks, these typically assume that the data consist 
of independent and identically distributed samples. However, it is 
often the case that the available data have a more complex mean 
structure plus additional components of variance, which must then be 
accounted for in the estimation of a Bayesian network. In this pa- 
per, score metrics that take account of such complexities are proposed 
for use in conjunction with score-based methods for the estimation of 
Bayesian networks. We propose firstly, a fully Bayesian score metric, 
and secondly, a metric inspired by the notion of restricted maximum 
likelihood. We demonstrate the performance of these new metrics for 
the estimation of Bayesian networks using simulated data with known 
complex mean structures. We then present the analysis of expression 
levels of grape berry genes adjusting for exogenous variables believed 
to affect the expression levels of the genes. Demonstrable biological 
effects can be inferred from the estimated conditional independence 
relationships and correlations amongst the grape-berry genes. 

Bayesian network; complex mean structure; exogenous variable; grape- 
berry gene expression; regulatory network; score-based metric; variance com- 
ponents. 
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1 Introduction 



The inner workings of a cell are very complex, with many interacting com- 
ponents. Determining how the genes within a cell interact with each other 
is an important, but difficult, field of research, often requiring the appli- 
cation of advanced statistical methods. Systems of these gene interactions 
are known as genetic regulatory networks, and the extent to which such 
networks may be inferred from observational gene expression data remains 
largely undetermined. To explore this question carefully and quantitatively, 
high-dimensional multivariate models, including Bayesian networks, need to 
be considered. The use of Bayesian networks for the modelling of genetic 
regulatory networks has been discussed by several authors: see for example 
[31 El [18] . Their popularity lies in the provision of a flexible framework for 
the estimation of conditional dependence relationships, thereby providing a 
means to estimate a covariance matrix given a high- dimensional sample when 
maximum likelihood methods are unavailable, [5J. Estimation of such struc- 
tures allows insight into how the expression levels of large groups of genes 
are related to one another, which, in turn, should help shed light on genetic 
regulatory networks involving the genes. 

For the most part, it has been assumed that the data used to estimate the 
networks are independent and identically distributed. In the present paper, 
we consider the important case where the assumption of independent and 
identically distributed samples is not satisfied, and propose new methods to 
allow for the estimation of effects of interest given such complexity. Our 
theoretical development has been motivated by an observational time course 
microarray study, involving expression levels of grape-berry genes observed 
over time and known to be associated with changes in temperature. The 
grapes were sampled from three vineyards in different regions of southern 
Australia and data on the ambient temperatures during the times leading 
up to the picking of each sample of grapes was also measured. We want to 
investigate the conditional dependence structure of the genes, adjusting for 
the exogenous effects of temperature and vineyards, and we aim to do this 
through the estimation of a Bayesian network. If the effect of temperature 
is unaccounted for in the estimation of a Bayesian network for these genes, 
because of their common relationship with temperature, many pairs of genes 
will exhibit strong correlations. Unless the gross effects of vineyard and 
temperature are removed, one cannot hope to detect more subtle associations 
between genes. 

There are many methods available for the estimation of Bayesian networks 
given microarray and other high-dimensional datasets, and these may be di- 
vided into two broad categories, namely, score-based and constraint-based 
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methods, [22]. Score-based methods attempt to maximize some score met- 
ric associated with the estimated Bayesian network, whilst constraint-based 
methods estimate conditional independence relationships directly from the 
data, and combine these to form a Bayesian network. Constraint-based meth- 
ods test for conditional independence relationships, so the networks obtained 
through their application can be quite sensitive to Type I and Type II errors, 
particularly when the sample sizes are small. Score-based methods on the 
other hand are not as sensitive to small sample sizes, and instead of finding 
the best local structure for each node, find the best global structure given the 
data, often resulting in more parsimonious models. Given that gene expres- 
sion data sets tend to be high-dimensional with the attendant 'small n, large 
p 1 problem, we approach the problem of Bayesian network estimation from 
a score-based perspective, and extend these to include exogenous variables 
and dependent data. 

The outline of the paper is as follows. In Section [2j Bayesian networks and 
score metrics are briefly reviewed, and our two new score metrics for datasets 
with complex mean structure and random effects are presented. The new 
score metrics are used to estimate Bayesian networks for simulated datasets 
with a known complex mean structure in Section I3.2[ and then applied to 
the analysis of the grape-berry gene expression data in Section HI In Section 
[51 we present a brief summary of our overall findings. 

2 Bayesian networks and Score Metrics 

2.1 BGe, the basic Bayesian score metric 

Consider a random vector X = (Xi, . . . ,X p ) T . A Bayesian network B for 
X consists of two components: a directed acyclic graph G = (V, E) with 
V = {Xi, . . . , X p }, often written as V — {1, . . . ,p}, and assumed conditional 
distributions / (xj | Xp v 6i) ,i = 1, . . . ,p. The set Pi is the set of parents 
of Xi in G and = {9\, . . . ,8 p } is the set of parameters associated with 
the conditional distributions. The graph and conditional distributions then 
specify a joint distribution for X: 

p 

f(x\G,e) = l[f(x i \x Pi ,e i ). (!) 

i=l 

Bayesian networks encode information about the conditional independence 
relationships between the variables in X. The directed Markov properties, 
as described in [IT], for example, allow conditional independence statements 
about X to be read from the graph G. Additionally, when the available data 
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set is high-dimensional, and maximum likelihood estimation of the covariance 
matrix of X is unavailable, estimation of a Bayesian network allows the 
estimation of the covariance matrix. 

Here we consider a vector of pre-processed, normalized expression levels 
for p genes, and suppose that X ~ N p (0,H), where £ is unknown. We 
consider a data set d = {x%, . . . , x p }, where Xi = (xn, . . . , Xj n ) is the vector 
containing the n samples of the expression levels of gene i. The estimation 
of a Bayesian network for X given this data set d consists of learning the 
structure of a directed acyclic graph encoding the conditional independence 
relationships between the variables, and the estimation of the parameters 0. 

As described in the Introduction, a score-based approach to learning the 
structure of the graph encoding the conditional dependence relationships 
of X is taken. After deciding on a score metric, we want to find a graph 
that maximises that score metric, and an obvious choice is the likelihood 
function of directed acyclic graphs given the data. The maximum likelihood 
score turns out not to be a good idea, however, as it inevitably assigns 
the complete graph, encoding no conditional independence relationships, the 
maximum score. For more detail, readers are directed to Section 18.3 of [T5] . 

To avoid the problems with overfitting associated with the maximum 
likelihood score, the Bayesian score was developed. Following [9] among 
others, the Bayesian score metric for the estimation of a directed acyclic 
graph G given some data set d is proportional to the posterior probability of 
that graph: 

S(G | d) = p{G)f{d | G) = p(G) [ f(d | G, 0)/(e | G)dS. (2) 

Here the focus is on the second component of this score, the marginal model 
likelihood of the data given the graph G, where the density of d given G and 
is assumed to be an np-dimensional normal density, with mean vector 
and covariance matrix £ (g> I n - As per Equation (pQ), this joint density may 
be decomposed into a product of p conditional densities. When the data 
set d consists of independent and identically normally distributed samples, 
0j = {li,ipi}, and 

Xi | X Pt , 7i, Ipi ~ N n (Xp^i, tpJn) . 

Given normal- inverse gamma priors for each 9i, or an equivalent inverse 
Wishart prior on 0, the score metric of Equation ([2} can be written as the 
product of the prior density on the space of directed acyclic graphs, p(G), and 
p multivariate t densities. This score metric, only appropriate in the case of 
independent and identically distributed samples, is known as the BGe metric: 
"Bayesian metric for Gaussian networks with score equivalence". 
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2.2 BGeCM, the score metric for data sets with com- 
plex mean structure 

We now consider the case of a more complex data set d, that does not consist 
of independent and identically distributed samples, such as the grape-berry 
gene data set described in Section [TJ As explained there, contained within 
that data set is information about exogenous variables thought to affect the 
expression levels of the genes under study. Given such a data set, we now 
express the model for the vector of expression levels of gene i as 

Xi | x Pi , 7i, ipi, hi, (pi ~ N n {xp^i + Qh h ipj) , (3) 

where hi is the m- vector of the effects of the m exogenous variables on gene 
i, (pi are the parameters associated with the (as yet to be selected) prior 
distribution for hi, and Q is the nxm matrix containing the data associated 
with the m exogenous variables. It can be seen that in this specification, we 
retain linear dependence upon expression levels of parent genes, but now, in 
addition to that dependence, more complex sampling schemes and the influ- 
ence of exogenous variables are accounted for through the linear dependence 
of expression levels upon hi. 

Including exogenous variables in the estimation of a Bayesian network is 
important in order to obtain an unbiased estimate of the conditional depen- 
dence relationships between the genes of interest. For example, the expression 
levels of two genes may both be dependent upon changes in an exogenous 
variable, but conditionally independent of each other. If dependence upon 
exogenous variables is not accounted for, an edge between these two genes is 
likely to be present in an estimated graph. By accounting for the effects of 
exogenous variables, we can have more confidence that the conditional de- 
pendence relationships obtained represent actual dependence relationships, 
and are not due to common relationships with exogenous variables. 

As can be seen by the definition of the Bayesian score metric given by 
Equation (jSJ), a joint prior distribution for •y i ,ipi,bi and (pi is required for 
the calculation of a score metric. Care is required in the specification of 
this prior distribution, since if priors are not properly selected, a score met- 
ric that gives different scores to directed acyclic graphs encoding equivalent 
conditional independence restrictions will be induced. A score metric that 
does not discriminate between equivalent directed acyclic graphs is called an 
equivalent score metric. Discrimination between equivalent graphs is tanta- 
mount to assigning causal meaning to the directed edges of G, and since the 
emphasis here is on the estimation of graphs given observational data, the 
assignation of causal meaning to the estimated relationships is not appropri- 
ate. 
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Extension of the results in [10J to our model indicates that the joint prior 
distribution for / -f i ,ip i ,bi given fa must have a normal-inverse gamma form, 
and that the effects of the exogenous variables on one gene must be a priori 
independent of the effects upon another gene, for the induced score metric 
to satisfy equivalence. The following system of priors are used: 



There are several possible assumptions about the form of the prior distribu- 
tion of the variance of the random effects for gene i, f(fa). Among other 
choices, the variance of the random effects could be assumed known, a uni- 
form prior could be placed on y/fa, or an inverse gamma prior could be placed 
on fa. However, by an extension of the results in [10J, when 0j 7^ v~ 1 ipi, any 
choice of prior distribution on fa will result in a marginal model likelihood 
without a closed form, requiring numerical integration to compute and slow- 
ing down computations. 

A simulation study in [14J showed that the learnt network structure is 
quite robust to the misspecification of the prior density for fa, provided the 
magnitude of fa is correctly specified. Hence, for computational simplicity, 
the following prior for the variance of the effects of exogenous variables is 
used: 



where v is some positive parameter that is constant from gene to gene. If 
v — T, then hi and ji are independent and identically distributed. Taking 
v > t implies that the 6j are less variable than the 7^, while v < t implies 
that the hi are more variable than the 7^. If t; is taken to be very large, 
this is equivalent to assuming that the effects of exogenous variables do not 
contribute much to the overall variability of the expression levels of genes. 

Although it will be application dependent, it may be that the assump- 
tion the variance of the exogenous variables is related to the variance of the 
regression parameters in the same way for each gene is not be valid. In this 
situation, a separate Vi could be specified for each gene, but such specification 
would require information that is most probably unavailable. Alternatively, 
a hyperprior distribution could be placed upon the V{. However, any choice 
of such a distribution would lead to a score metric without an exact form, 
again requiring numerical integration to compute. 

When hi ~ N m (0, v~ l ipil), the marginal model likelihood for a particular 
random variable given its parents in the graph G, can be shown to be 




k\fa~ N m (0, fal) . 



h I fa~N m (0,v-%l) 
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with 

^xp. = S +\ P .\ { j ~ jxp * ( ri + x p 1 jx p*)~ 1 x p> j } » 

J = I-Q(vI + Q T QY l Q T . (4) 

We call the resultant score metric the BGeCM metric: "Bayesian metric for 
Gaussian networks having score equivalence for data sets with a complex 
mean structure". 

Posterior distributions of the parameters ji, ipi and b{ allow a detailed 
analysis of the relationships between random effects and the expression levels 
of the genes of interest. The posterior distributions of 73 given ipi, hi given 
ipi and ipi are given by 



ji I Xi,ipi,x Pi ~ N\ P .\ ({ri +Xp.Jx Pi ) 1 XpJXi,iJi(rI + XpJx Pt ) 1 
where J is as given in Equation (J3J). Further, 

b t I x^xp. ~ N m (l(yI + Q T J*Q)- 1 Q T rx i ,ip i (vI + Q T J*Qy^ , 



J* = I - x Pi (ri + Xp.x Pi ) 



- 1 JT 



X 



and 



ipi I x i ,xp i ~ Inv Gamma I - + ^- + - ,(3^ 



0A = l^ + \ x l { J - Jx p 1 {ri + x T Pi Jx Pi ) 1 x T p,j} 



Note that instead of using a score metric as developed above to allow 
for the inclusion of exogenous variables in the model, an extended directed 
acyclic graph could be learnt, where exogenous variables are included as 
vertices in the graph. There are however, a couple of difficulties presented 
by such an approach. The first is that if the exogenous variables are discrete, 
methods for Bayesian networks on both continuous and discrete variables 
are required. Additionally, many algorithms for learning directed acyclic 
graphs incorporate sparsity constraints, and if it is believed that many of the 
genes are affected by these exogenous variables, these sparsity constraints 
will require modification. 



2.3 Removal of random effects through analysis of resid- 
uals 

In the derivation of the BGeCM score metric, it was assumed that the effects 
of exogenous variables on gene expression were of intrinsic interest. However, 



7 



in many situations, the effects of exogenous variables can be thought of as 
nuisance variables, complicating the estimation of Bayesian networks for the 
given gene expression levels. It may be desirable to ignore the possible in- 
fluences of such effects upon gene expression levels, and on the relationships 
between genes. Of course, simply ignoring such effects is not recommended. 
Instead, we develop a non-parametric approach that adjusts for the effects 
of exogenous variables, without making assumptions about the form of their 
distributions. This approach, instead of directly using the gene expression 
data, is based upon the use of linear combinations of residuals left over after 
the data is regressed upon the effects of the exogenous variables. We call this 
the "residual approach", and it is inspired by the restricted maximum like- 
lihood procedure used in inference for mixed linear models; see for example 
Section 12.2 of [2J, or Speed [2TJ, which provides a good overview of REML. 

The utility of the residual approach is that it makes no assumptions 
about the distributional form of the random effects of interest. Since no 
such assumptions are made, the approach is correct no matter what the 
true distribution of the random effects may be. Hence, in situations when 
the assumption that 6, | 4>i ~ N m (0, v~ l ipil) is not satisfied, the residual 
approach provides a useful alternative to the BGeCM score metric, and, as 
we demonstrate below, is considerably easier to implement. 

We consider an (n — to) x 1 random variable yi = P T Xi, where P is an 
n x (n — to) matrix such that 

P T Q = 0, P T P = J n _ m , PP T = I n - Q(Q T Q)- 1 Q T . 

Hence, 

and the score metric associated with this set of marginal model likelihoods 
is invariant to the choice of P. Implementation of the residual approach to 
the estimation of Bayesian networks is therefore simple: after selection of an 
appropriate matrix P and computation of jji = P T x,i for i = 1, . . . ,n, the 
BGe score metric may be applied to this reduced data set in conjunction 
with the score-based method of choice. 

A drawback of the residual approach is that posterior estimates of the 
random effects h\ are not admitted. However, any potential loss of informa- 
tion about the underlying covariance matrix when the residual approach is 
used, compared to the 'full' BGeCM score metric, has been investigated in 
[13], and found to be typically small. 



8 



3 Numerical study of BGeCM and the resid- 
ual score metrics 



3.1 Implementation of BGeCM and the residual score 
metrics 

In this section, the necessity of score metrics that take account of complex 
mean structure are demonstrated through the application of the residual 
approach and the BGeCM score metric to simulated and real data sets. 

First, a note on implementation. The BGeCM score metric and the 
residual approach may be incorporated into any score-based algorithm for 
the estimation of Bayesian networks, without the need for any additional 
programming. In the case of the residual approach, all that is required is 
the calculation of the matrix P, satisfying the conditions in Equation ( 12. 3p . 
Then, instead of inputting d into the algorithm of choice, the augmented 
data set P T d is input. Similarly, when the BGeCM score metric is used, an 
augmented data set L T d will be the input into the algorithm, where L T is a 
matrix such that J = LL T , where J is as given in Equation (J4]). 

Here we apply the residual approach and BGeCM score metrics in con- 
junction with the high- dimensional Bayesian covariance selection algorithm, 
[I], a score-based method for the estimation of Bayesian networks. This 
algorithm works by constructing and combining regression models for each 
X t . 

3.2 Simulated data sets 

Example 1: In this first example, 10 data sets were generated according to 
the following system of linear recursive equations: 

X ijk = hj + e ijk , e< ~ N(0, ^) (i = 1, . . . , 100; j = 1, 2; k = 1, . . . , 50). 

The values of ipi were obtained by sampling from an Inverse Gamma(l, 1/2) 
distribution, and are constant for each of the samples generated. Similarly, 
b% = (bn, bi2) T ' , i = 1, • • • , 100, are fixed across data sets, obtained by sampling 
from 

fey ~JV(0,^) (i = 1, - - - , 100; J = 1,2), 

corresponding to v = 1. The non-zero mean structure of this example corre- 
sponds to two groups, and the true underlying directed acyclic graph is the 
empty graph. 
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Example 2: The system of linear recursive equations governing this ex- 
ample is 



Xik — qikbn + Q2kbi2 + Q3kbi3 + e 



ik 



1,...,18), 



-^19,fc — Qlkbl9,l + Q2kbl9,2 + <?3fcfrl9,3 + 7l9,l^lfc + 7l9,2-^2fc + ^19,*;, 
-^20,fc = <Zlfc&20,l + 92fc&20,2 + ?3fc&20,3 + 720,19^19,fc + e20,fc, 

e ife ~iV(0,^) (^ = 1,..., 20; fc = l,...,10). 

Ten data sets were generated according to this system of equations, and the 
parameters ^ (i = 1, ...,20), 719 = (719,1, 7i9,2) T and 720,19 were assumed 
constant across these data sets. The values of these parameters were obtained 
by sampling from the following distributions: 



ipi ~ Inv Gamma 



2 + \Pj\ 1 
2 ' 2 







1,...,18), IP, 



19 1 



20 1 



719 ~ N 2 (0, tpigl 2 ) , 720,19 ~ N(0, ^20) ■ 

Similarly, the random effects foj = (bn, bi 2 , &i3) T = 1, • • • , 20), were constant 
across the 10 data sets generated, obtained by sampling from 

b tJ ~N(0,^) (2 = 1,..., 20; J = 1,2,3), 

again corresponding to v — 1. 

The true model for each variable may be written as 

Xi I Vi, h ~ A^io (Q6i, ^10) (i = 1, . . . , 18), 

279 I 7i9» ^19, fo i9 ~ ^10 (^Pi 9 7i9 + Q&19, ^lgAo) , 
^20 I 720, ^20, b 2 o ~ N 10 (xp 20 7 20 + <5&2o, ipmho) , 



where 



Xi 



\ XilO J 
( X U X 2 ) , Xp 2i 



X19 
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Table 1: Mean and standard deviation of the number of spurious and cor- 
rect edges in the highest-scoring Bayesian networks obtained when the score 
metrics are applied to data sets simulated according to Examples 1 and 2 





Example 1 


Example 2 


Score Metric 


Spurious Edges 


Correct Edges Spurious Edges 


BGe 


117- 2(5- 25) 


1- 8(0- 63) 2- 4(1- 17) 


BGeCM 


1- 0(0- 94) 


2- 2(0- 92) 0- 4(0- 70) 


Residual Approach 


1- 7(1- 16) 


1- 1(0- 32) 0- 0(0- 00) 



and 



/ 9n 



Q 



qu 921 ?3i 



\ 9l,10 92,10 93,10 / 



V 



32 
22 
37 
53 
73 
92 
02 
27 
64 
15 



83 
37 
61 
52 
01 
87 
44 
59 
20 
48 



74 \ 
55 
60 
82 
93 
09 
04 
11 
21 
12/ 



In this case, elements of the Q matrix consist of random samples from the 
standard normal distribution, treated as known constants in the analysis, 
and the true underlying graph has three edges. 

Bayesian networks were estimated for each of the data sets generated ac- 
cording to Examples 1 and 2, using the BGe and BGeCM score metrics and 
the residual approach in conjunction with the High-dimensional Bayesian Co- 
variance Selection algorithm. After assessing the performance of the BGeCM 
score metric under ideal conditions, we assess the sensitivity of this metric 
to the misspecification of v. 

For each of these analyses, the number of spurious and correct edges in the 
highest-scoring network found by the algorithm was recorded. The results 
are summarized in Tables [T] and [2J Table [T] gives the mean and standard 
deviation of the number of spurious and correct edges in the highest- scoring 
Bayesian networks found when BGe, BGeCM and the residual approach, 
with v set at the correct value, that is v = 1, are used to estimate the true 
graph encoding the conditional independence relationships. Table [2] gives the 
numbers of correct and spurious edges when BGeCM is used given a range 
of values of v. 

Comparing the results obtained when the BGe score metric is used to 
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Table 2: Mean and standard deviation of the number of spurious and correct 
edges in the highest-scoring graphs found through the application of BGeCM 
for varying values of v, (standard deviation in brackets). 



Example 


V 

0-0001 0-001 0-01 0-1 1 2 


1 - Spurious 

2 - Spurious 
2 - Correct 

Example 


1 -4(0-97) 1-4(0-97) 1-4(0-97) 1-3(0-82) 1-0(0-82) 1-4(1-07) 

-9(1-10) 0-8(0-92) 1-0(0-82) 0-9(0-99) 0-4(0-70) 0-6(0-70) 

1 -9(0-57) 2-2(0-63) 2-2(0-63) 1-9(0-74) 2-2(0-92) 2-1(0-74) 

V 

5 10 20 50 100 1000 


1 - Spurious 

2 - Spurious 
2 - Correct 


3 -8(2-39) 15-8(3-46) 45-6(5-76) 84-6(4-17) 101-7(4-42) 119-4(2-88) 
-4(0-52) 1-1(0-88) 1-3(1-16) 2-1(0-88) 2-3(1-25) 2-2(1-48) 
2 -5(0-53) 2-1(0-74) 2-2(0-79) 2-0(0-82) 1-8(0-79) 2-0(0-82) 



analyse the simulated data sets demonstrates the utility of both the BGeCM 
score metric and the residual approach. The two new score metrics result 
in the estimation of structure which is much closer to the true structure. In 
addition, Table |2] shows that the results obtained from the BGeCM score 
metric are quite robust to the misspecification of v, producing accurate re- 
sults when the value of v selected departs as much as one or two orders of 
magnitude from its true value. As v gets larger, the highest scoring graphs 
obtained become more and more similar to those obtained when the BGe 
metric is used. This is a result of the fact that as v approaches oo, the limit 
of the BGeCM metric is the BGe metric, [13]. 



4 Analysis of the grape-berry microarray data 

The data analysed here consisted of 50 samples of gene expression levels for 
26 grape genes measured over a four-week period. The gene expression levels 
were derived from grape-berry tissue samples grown in three different vine- 
yards in three different wine-growing regions of southern Australia. Twenty 
samples were taken from a vineyard in Clare, 20 from the Wingara Vine- 
yard in Mildura and 10 from a vineyard in Willunga. Table [3] provides the 
reference numbers for the 26 grape genes, together with a brief summary of 
their functions. All the genes in Table [3] are known to code for heat shock 
proteins (HSPs), [21], which are responsible for protecting the grapes against 
heat-induced stress. In addition to data on the gene expression levels, tem- 
perature in degrees celcius was recorded during the time leading up to the 
picking of the grapes. 
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These data are part of a larger dataset on grape-berry tissue samples 
measured between 2003 and 2005. At each vineyard, grapes were sampled 
roughly weekly over the period of development of the berries, i.e., from the 
time buds formed on the vines, to the time when the grapes were ripe. In gen- 
eral, grape berries follow a double sigmoidal pattern of growth that consists 
of two distinct growth phases, with a lag period between these phases (see 
[U US])- The second stage of grape-berry growth commences upon the occur- 
rence of veraison, when the grape berries start to change colour. Robinson 
and Davies, |19j . suggest that at veraison and during ripening, there are many 
changes in the expression levels of many different genes in grape berries. The 
observed developmental time period from bud formation to grape ripeness 
differed between vineyards in the present study. The shorter four-week time 
period we analysed occurred after fruit set, but well before veraison for all 
three vineyards. We restricted attention to samples corresponding to the 
third to seventh sampling weeks at each vineyard because the relationships 
between genes are thought to be more stable during this period, and the mod- 
elling assumption of identically distributed samples is therefore more likely 
to be valid. 

mRNA expression levels for each of the grape tissue samples was mea- 
sured using Affymetrix Vitis vinifera oligonucleotide arrays. Background 
subtraction and normalisation was carried out using robust microarray anal- 
ysis (RMA), as described in Irizarry et al, [TTJ. Note that all samples were 
processed at the same laboratory. 

Understanding the stress tolerance mechanisms of plants is important, 
and the heat shock protein network, as discussed by [16] and [24], is very 
complex. The heat shock protein network of plants is believed to consist 
of interactions between small Hsps, Hsp60, Hsp70, Hsp90 and HsplOO, [24J. 
Precisely how Hsps interact with one another and how they protect against 
heat stress is not yet completely understood, and here we seek to gain some 
insight into the heat shock protein network by examining the conditional 
dependence structure of the genes given in Table [3] 

Given the known functions of the genes considered in this study and 
the climatic and geographic disparities between the regions where the grape 
berries were sampled, it would incorrect to ignore the effects of vineyard and 
temperature in the estimation of a Bayesian network for the grape genes. 
The essential point is that if the expression levels of these genes are strongly 
influenced by these exogenous variables, then accounting for variation due 
to such variables in the estimation of a Bayesian network should result in 
a network that more accurately encodes the dependence structure of the 
genes. A further important point is that given the grape gene expression 
levels analysed are observational data, causal interpretations should not be 
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applied to the directed edges present in any network estimated given the data. 
Hence, moralized versions of directed acyclic graphs are used to summarize 
conditional independence relationships of the grape genes. 

To begin, the initial (null) model omitted the effects of vineyard and 
temperature on the expression levels of the genes. That is, if Xi is the 50- 
vector of the expression levels for grape gene i, it is assumed that 

Xi | arp.,7i, V>» ~ N 50 (x Pz 7^/50) , 
ji I ipi ~ iV| Pi |(0, 7-"tyiJ|Pi|)> ~ Ga 

where xp i is a 50 x \Pi\ matrix. The columns of this matrix consist of the 
expression levels of the grape genes in the dataset that the expression level of 
gene i is dependent upon, 73 = (7ij)j 6 p. an d Jij is the effect of the expression 
level of gene j on the expression level of gene i. Following the analysis of 
Affymetrix gene expression data in [I], r — 1 and 5 = 2. 

The highest scoring Bayesian network found through the application 
of the high-dimensional Bayesian covariance selection algorithm to the full 
dataset (ignoring the exogenous variables of vineyard and temperature) has 
55 edges, and the moralized version has 130 edges. The moralized version is 
shown in Figure [T^a). 

Next, graphs were estimated separately for each of the three vineyards. 
The highest-scoring directed acyclic graphs obtained for the Clare, Wingara 
and Willunga vineyards had, respectively, 22, 23 and 17 edges. These graphs 
were quite different from one another, with the three graphs having only 
two edges in common, the Wingara and Clare graphs sharing eight edges, 
and the Willunga graph having three edges in common with the Wingara 
and Clare graphs. Given the paucity of the data and the complexity of the 
models, this lack of concordance between the graphs obtained separately for 
each vineyard is not surprising. 

In order to make more efficient use of the data, models incorporating data 
from all three vineyards simultaneously were then considered. 

The question of how best to include temperature and vineyard effects in 
the model for gene expression was investigated using linear regression models 
with forward and backward selection. The largest model fitted for each gene 
contains separate intercepts for the data from each vineyard, and terms for 
each of the temperatures recorded 30, 90, 150, 210, 270 and 330 minutes before 
the grapes were picked. We also considered the model including vineyard and 
temperature main effects and two-way temperature interactions. For the full 
model with interactions, it was observed that the adjusted R 2 of many of 
the regressions was above 0.99, indicating that some over-fitting was taking 
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(a) 




Figure 1: The moral versions of the highest-scoring graphs obtained for the 
grape genes when (a) the effects of temperature and vineyard are ignored, 
and when the residual approach is used to include (b) vineyard effects, and 
(c) vineyard effects and main temperature effects. 



place. We therefore exclude two-way temperature interaction effects in what 
follows. 

Results of the stepAIC function in R, [23], indicate that there is not a 
single backwards elimination step that would apply to all genes. That is, 
each of the vineyard or temperature variables is significant in at least one of 
the 26 regression models estimated. In any case, use of separate regression 
models for each gene is beyond the scope of the present score metrics. As 
such, the largest model considered is as follows: 



where bi = (bn, . . . , big) and b^ (j = 1, 2, 3) is the effect of vineyard j on the 
expression level of gene i, bi^, and . . . , big are the temperature effects. 

Histograms of the marginal standard deviations of the expression data for 
each gene, and the residual standard errors from the regressions containing 
vineyard, and vineyard and temperature as covariates, are shown in Figure [2j 
Note that there are three genes with very small standard deviations. These 
plots show that vineyard variables account for only some of the variation in 
the gene expression levels. Changes in temperature and vineyard account 
for much more of the observed variation, but there remains some residual 
variation to be explained. On the basis of these histograms, we expect that 
the graph obtained when only vineyard is included as an exogenous variable 
will be somewhat similar to that obtained when the exogenous variables are 
ignored, whilst we would expect to see a reasonably different structure when 
both vineyard and temperature are accounted for. 

In accounting for the effects of vineyard and temperature, we find high- 
scoring Bayesian networks using the BGeCM score metric, first fitting the 
model with vineyard effects only, then the model with vineyard and main 
temperature effects. The highest-scoring Bayesian networks found for v = 
0.5, 1 and 10 were recorded and their moral graphs summarized in Table HI 
It can be seen that as more covariates are included in the model, more of the 
variation in the expression levels of the grapes is explained, and the highest- 
scoring graphs obtained have fewer edges. Edges that are removed as more 
exogenous variables are included in the model can be interpreted as being 
explained by common relationships of genes with these additional covariates. 

The BGeCM score metric assumes that the effects of the exogenous vari- 
ables are independent and identically distributed, an assumption that must 
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Marginal Standard Deviations 



(a) Standard errors of expression levels 
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Residual Standard Deviations 

(b) Residual standard errors after regressing expression levels on vineyard 
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(c) Residual standard errors after regressing expression levels on vineyard 

and temperature 

Figure 2: Histograms of the marginal^ standard deviations of the grape gene 
expression levels and the residual standard errors after regressing the expres- 
sion levels on vineyard only, then vineyard and temperature. 



be questioned. The effects of temperature and vineyard are almost certainly 
not iid. However, there is little information available to provide a useful es- 
timate of the covariance structure of the effects of these exogenous variables. 
Therefore, the residual approach, which makes no assumptions about the 
covariance structure of the effects included in the model, is preferred here. 

The number of edges in the moralized versions of the highest-scoring net- 
works obtained using the residual method are summarized in Table HI When 
only the effects of vineyards are included in the model, the results obtained 
using the residual method are similar to those obtained when the BGeCM 
score is used, as expected on the basis of the histograms in Figure [2] When 
the effects of temperature are included in the model, the residual method pro- 
duces high-scoring graphs with fewer edges than the BGeCM score metric. 
This indicates that whilst the BGeCM score metric may account correctly 
for the covariance structure of the effects of vineyard, the effects of tempera- 
ture may have a more complicated variance structure, that is not adequately 
modelled by the iid assumption. 

The moralized graphs obtained from the residual method are displayed in 
Figure [H These graphs, drawn using Graph Viz, [5J, show that as more of the 
variation in gene expression due to exogenous sources is accounted for in the 
model, the moral graphs of the highest-scoring networks obtained have fewer 
edges. The graph obtained by including both temperature and vineyard as 
exogenous variables, Figure Q](c), is preferable to that obtained when only 
vineyard is included, Figure Q](b). For most genes, very little variation in 
the gene expression values is accounted for by the relationship with vineyard 
alone. 

There are a number of interesting features to be observed in graph Fig- 
ure H](c), which is the graph obtained when both vineyard and temperature 
effects are accounted for in the model. We observe that seven nodes in this 
graph are completely disconnected from all other nodes, which implies that 
once relationships with temperature and vineyard have been accounted for, 
the expression levels of each of these genes are independent of the expression 
levels of all other genes. (Recall that absence of an edge between two genes 
in Figure (He) indicates that the expression levels of these nodes are inde- 
pendent, a relationship which can be refined through application of Markov 
properties.) 

It is apparent that three of these seven disconnected nodes, corresponding 
to genes 14, 18 and 23, are already disconnected from the rest of the graph 
when only vineyard is included in the model; see Figure d^b), where it is 
observed that these are the only three unconnected nodes. The expression 
levels of these three genes are observed to have the lowest standard deviations 
of all the genes, at 0.037, 0.034 and 0.068 respectively, and are in fact the 
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three genes at the left-most end of the rug in Figure E^a). When these three 
genes are regressed on vineyard, the residual standard deviations are even 
smaller (0.029, 0.023 and 0.047). In other words, there is no variation in the 
expression levels of these genes to be explained by relationships with other 
genes, so it is not surprising that they are unconnected in the graph. Very 
small gene standard deviations can be problematic in microarray data anal- 
ysis, and methods have been proposed for adjusting the standard deviation 
estimates upwards by adding a constant term or by application of empirical 
Bayes methods when constructing t-tests, for example, (20]. Such adjustment 
is beyond the scope of our present analysis however. Note that the fact the 
three genes are connected in Figure QJa) is suggestive of overfitting when the 
BGe metric is used. 

Genes 9, 10 and 11 are also disconnected in the final graph, Figure (He), 
and correspond to Hsp81, which is an early response to dehydration. Ac- 
cording to the KEGG data base, [12], they are predicted to be similar to 
Hsp90. The role of these sets of genes in the heat shock network of grapes is 
not entirely understood. The role of Hsp81 proteins in Arabidopsis thaliana, 
more commonly known as thale cress, has been discussed in [23], who note 
that an increase in the expression level of Hsp81-1 is possibly caused by a reg- 
ulatory pathway other than the heat shock pathway. Our analysis supports 
this finding for Vitis Vinifera, indicating that Hsp81 may not be implicated 
in the heat shock protein network of grapes, at least over the four-week time 
period studied. We have established that variation in the Hsp81 genes is ac- 
counted for directly by the effects of the exogenous variables, and that they 
are uncorrelated with the other HSPs in the final graph. 

The seventh gene, number 26, which is a mitochondrial small Hsp, is not 
implicated in the final network either. This gene is the only mitochondrial 
gene considered. That it is unconnected from the rest of the network indicates 
that variation in this gene is explained purely by exogenous temperature and 
vineyard effects, and is not dependent upon any of the other genes in the 
dataset. This suggests that the mitochondrial HSP are not regulated in the 
same way as other cellular HSPs. 

On the whole, relatively little is known about the heat shock regulatory 
network for grapes. Typically in the representation of the heat response net- 
work for plants, relationships between classes of genes, such as small Hsps 
or Hsp 70s are discussed, [24J. The graph obtained here provides a good 
starting point for the development of a finer structure, which can then be 
further developed. The edges between the genes in Figure QJc) can be in- 
terpreted as encoding conditional dependence relationships. This graph is 
the moralized version of the directed acyclic graph found, and more detail is 
available through consideration of the class partially directed acyclic graph, 
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or class PDAG, of the underlying directed acyclic graph. However, since we 
are analysing observational data, we will consider here edges in the moral- 
ized version of the graph only. We observe that there are two pairs of genes 
connected by a single edge. The first pair of nodes is (12,19), represent- 
ing a chaperone gene (gene 12) which promotes the folding and unfolding 
of proteins and a class I small HSP (gene 19). This is an undirected edge, 
but the two genes are correlated after adjusting for the exogenous variables, 
and the chaperone gene 12 has no other connecting edges. The second pair 
of connected nodes is (2,8), representing a glucose regulated HSP (gene 2) 
and the ubiquitin conjugating enzyme 4e (gene 8); again these two genes are 
correlated after adjusting for the exogenous variables, and the enzyme gene 
8 has no other connecting edges. 

Further investigation of the connected nodes and possible regulatory heat 
shock mechanisms is beyond the scope of the present paper, and would re- 
quire further biological evidence and possible investigation. It is clear from 
the detection of the unconnected nodes together with the plausible relation- 
ships between nodes connected by a single edge, that analysis of the data 
using the new score metrics has demonstrable utility and has detected real 
biological effects. 

5 Discussion 

The BGeCM score metric and the residual approach presented in this pa- 
per enable Bayesian network structures to be learnt given datasets that do 
not consist of independent and identically distributed samples, and may be 
used in conjunction with any score-based method for the estimation of a 
Bayesian network. Furthermore, the residual approach allows the estimation 
of a Bayesian network for datasets with a complex mean structure without 
the need to specify the variance structure of the mean effects. This approach 
proved useful for the analysis of the grape-berry gene data, where it could not 
reasonably be supposed that the effects of the exogenous variables were in- 
dependent and identically distributed. Our analysis of the grape-berry gene 
microarray data has resulted in biologically plausible conclusions on the heat 
shock regulatory network of grape genes. These inferences could not have 
been drawn without the availability of suitable score metrics to account for 
the effects of exogenous variables. 
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Table 3: The grape heat shock genes. The first column gives the gene refer- 
ence numbers used in this study, the second column gives the Affymetrix ref- 
erence numbers, and the third column gives the National Center for Biotech- 
nology Information reference numbers. The fourth column provides a short 
description of the function of the genes. Note that HSP stands for heat shock 



protein. 








Ref # 


Affymetrix # 


NCBI # 


Protein Identity 


1 


1616246^,t 


Vvi.9142 


Heat shock protein 70, ATP binding 


2 


1607002jat 


Vvi.4801 


Heat shock protein 70, ATP binding, 
luminal binding protein, glucose regulated 


3 


1610684^at 


Vvi 2869 


chloroplast HSP 70-1, ATP binding 


4 


1611740^at 


Vvi.295 


unknown 


5 


1620985^,t 


Vvi.4530 


HSP21 chloroplast 


6 


1616995jat 


Vvi.23518 




7 

8 


1614132^at 
1618265jat 


Vvi.863 
Vvi. 15427 


Ubiquitin conjugating enzyme 4e 


9 


1608052j3_at 


Vvi.9085 




10 


1618009.at 


Vvi.9085 


HSP81 (early response to dehydration) 


11 


1619931 J3_at 


Vvi. 7394 




12 


1608701^at 


Vvi.2083 


10 kDa chapcronin 


13 


1608164jat 


Vvi.6787 




14 


1611052jat 


Vvi.6787 




15 


1611192jat 


Vvi.6787 


Cytosolic class II 17.6 HSP 


16 


1610032.at 


Vvi.6787 




17 


1614330^at 


Vvi.6787 




18 


1620956jat 


Vvi.3921 




19 


1616538^at 


Vvi.7869 




20 


1609554.at 


Vvi. 7044 


17.6 kDa class I small HSP 


21 


1620960 Ji_at 


Vvi. 7044 




22 


1621652jat 


Vvi.4464 




23 
24 


1622165jat 
1612385 jat 


Vvi.6156 
Vvi.4422 


17.4kDa class I small HSP 


25 


1622628.at 


Vvi.5040 


17.4kDa class III small HSP 


26 


1610700.at 


Vvi.2537 


23. 6K mitochondrial small HSP 
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Table 4: Number of edges in the moral graphs associated with the highest- 
scoring Bayesian networks for 3 different sets of included exogenous variables, 
for the BGeCM score with varying values of v, where bi ~ N(0, v -1 ^), and 
for the residual method. 





BGeCM 






V 


Residual 


Included Covariates 


0.5 1 10 




Vineyard 


66 76 89 


68 


Vineyard and temperature 


57 63 68 


41 
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